Tree-Based Algorithms (Decision Trees, Random Forests, Gradient Boosting)#

Tree models are some of the most practical “workhorse” algorithms in machine learning:

  • Decision trees are interpretable if-else rules learned from data.

  • Random forests reduce variance by averaging many trees.

  • Gradient-boosted trees reduce bias by adding trees sequentially to fix mistakes.

A simple way to remember the core idea:

A tree is like playing 20 Questions with your data.

Each question splits the world into two parts. You want questions that quickly reduce uncertainty.

This notebook builds intuition with visual, Plotly-based demos, then implements simplified versions from scratch in NumPy, and finally connects everything to scikit-learn (and the modern boosting libraries).


Learning goals#

By the end you should be able to:

  • explain how decision trees work for classification and regression

  • use and compare Gini impurity vs entropy / information gain

  • implement a small CART-style tree from scratch (NumPy)

  • explain and implement bagging and a simplified random forest

  • understand what gradient boosting is doing (and how it relates to XGBoost / LightGBM / CatBoost)

Notation#

  • Dataset: \((x_i, y_i)\) for \(i=1..n\)

  • Feature matrix: \(X \in \mathbb{R}^{n \times d}\)

  • Targets:

    • classification: \(y \in \{1,\dots,K\}\)

    • regression: \(y \in \mathbb{R}\)


Table of contents#

  1. Decision trees: intuition

  2. Impurity measures (Gini, entropy, variance)

  3. A split in slow motion (Plotly)

  4. Decision trees from scratch (classification + regression)

  5. scikit-learn trees + key hyperparameters

  6. Ensembles: bagging and random forests (from scratch + sklearn)

  7. Gradient boosting and modern libraries (XGBoost / LightGBM / CatBoost)

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from dataclasses import dataclass

from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_text
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Decision trees: intuition#

A decision tree is a sequence of questions.

Example (classification):

  • “Is x1 0.7?”

    • yes → “Is x2 -0.2?” → class A

    • no → class B

Example (regression):

  • “Is area 80 ?” → predict $220k

  • else “Is area 120 ?” → predict $310k

  • else → predict $420k

A tree doesn’t learn a smooth curve. It learns a set of regions, and predicts a constant value per region.

2) Impurity measures (how to pick a good question)#

At any node, the data is a “bag” of samples.

A good split makes the children bags purer.

Classification#

Let \(p_k\) be the fraction of class \(k\) in the node.

  • Gini impurity:

\[ G = 1 - \sum_{k=1}^K p_k^2 \]
  • Entropy:

\[ H = -\sum_{k=1}^K p_k\log_2(p_k) \]

Both measure “messiness”. Both are 0 when the node is perfectly pure.

A common split score is information gain:

\[ \text{Gain} = I(\text{parent}) - \frac{n_L}{n}I(L) - \frac{n_R}{n}I(R) \]

where \(I\) is either Gini or entropy.

Regression#

A common impurity is variance (equivalently MSE around the mean):

\[ V = \frac{1}{n}\sum_{i=1}^n (y_i - \bar{y})^2 \]

Splits are chosen to reduce the weighted variance.

def class_counts(y: np.ndarray) -> np.ndarray:
    y = np.asarray(y)
    if y.size == 0:
        return np.array([], dtype=int)
    _, counts = np.unique(y, return_counts=True)
    return counts


def gini_impurity(y: np.ndarray) -> float:
    counts = class_counts(y)
    if counts.size == 0:
        return 0.0
    p = counts / counts.sum()
    return float(1.0 - np.sum(p ** 2))


def entropy(y: np.ndarray) -> float:
    counts = class_counts(y)
    if counts.size == 0:
        return 0.0
    p = counts / counts.sum()
    p = p[p > 0]
    return float(-np.sum(p * np.log2(p)))


def variance(y: np.ndarray) -> float:
    y = np.asarray(y, dtype=float)
    if y.size == 0:
        return 0.0
    return float(np.mean((y - y.mean()) ** 2))


def information_gain(y: np.ndarray, y_left: np.ndarray, y_right: np.ndarray, impurity_fn) -> float:
    n = y.size
    if n == 0:
        return 0.0
    n_left = y_left.size
    n_right = y_right.size
    if n_left == 0 or n_right == 0:
        return 0.0

    parent = impurity_fn(y)
    child = (n_left / n) * impurity_fn(y_left) + (n_right / n) * impurity_fn(y_right)
    return float(parent - child)


# Quick sanity check
print('gini([A,A,B,B]) =', gini_impurity(np.array([0, 0, 1, 1])))
print('entropy([A,A,B,B]) =', entropy(np.array([0, 0, 1, 1])))
gini([A,A,B,B]) = 0.5
entropy([A,A,B,B]) = 1.0

3) A split in slow motion (1D classification)#

We’ll create a 1D dataset and evaluate “How good is the split at threshold \(t\)?”

Think of the threshold as a question:

“Is \(x \le t\)?”

We’ll plot information gain for both Gini and entropy.

# 1D binary classification dataset
n = 140
x_1d = np.concatenate([
    rng.normal(loc=-1.2, scale=0.7, size=n // 2),
    rng.normal(loc=1.0, scale=0.8, size=n // 2),
])
y_1d = np.array([0] * (n // 2) + [1] * (n // 2))

# Shuffle
perm = rng.permutation(n)
x_1d = x_1d[perm]
y_1d = y_1d[perm]

fig = px.scatter(
    x=x_1d,
    y=y_1d,
    title="1D classification dataset (points are classes)",
    labels={"x": "x", "y": "class"},
)
fig.update_yaxes(tickmode="array", tickvals=[0, 1])
fig.show()
# Candidate thresholds: midpoints between sorted unique x values
x_sorted = np.sort(np.unique(x_1d))
thresholds = (x_sorted[:-1] + x_sorted[1:]) / 2.0

ig_gini = []
ig_entropy = []

for t in thresholds:
    left = x_1d <= t
    y_left = y_1d[left]
    y_right = y_1d[~left]

    ig_gini.append(information_gain(y_1d, y_left, y_right, gini_impurity))
    ig_entropy.append(information_gain(y_1d, y_left, y_right, entropy))

ig_gini = np.array(ig_gini)
ig_entropy = np.array(ig_entropy)

best_t_gini = float(thresholds[np.argmax(ig_gini)])
best_t_entropy = float(thresholds[np.argmax(ig_entropy)])

best_t_gini, best_t_entropy
(-0.3957878668246275, -0.3957878668246275)
fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=ig_gini, mode="lines", name="Information gain (Gini)"))
fig.add_trace(go.Scatter(x=thresholds, y=ig_entropy, mode="lines", name="Information gain (Entropy)"))
fig.add_vline(x=best_t_gini, line_dash="dash", line_color="blue", annotation_text="best (Gini)")
fig.add_vline(x=best_t_entropy, line_dash="dash", line_color="orange", annotation_text="best (Entropy)")
fig.update_layout(
    title="How good is a split? Information gain vs threshold",
    xaxis_title="threshold t",
    yaxis_title="information gain",
)
fig.show()

What did the “best split” actually do?#

Let’s look at the before/after impurity numbers at the best Gini threshold.

This makes the abstract formula feel more concrete:

  • parent node: one mixed bag

  • left/right nodes: two less-mixed bags

# Inspect impurity before/after the best split (Gini)
left = x_1d <= best_t_gini

g_parent = gini_impurity(y_1d)
g_left = gini_impurity(y_1d[left])
g_right = gini_impurity(y_1d[~left])

gain = information_gain(y_1d, y_1d[left], y_1d[~left], gini_impurity)

counts_parent = np.bincount(y_1d, minlength=2)
counts_left = np.bincount(y_1d[left], minlength=2)
counts_right = np.bincount(y_1d[~left], minlength=2)

print('counts parent:', counts_parent, 'gini:', round(g_parent, 4))
print('counts left  :', counts_left, 'gini:', round(g_left, 4))
print('counts right :', counts_right, 'gini:', round(g_right, 4))
print('information gain:', round(gain, 4))
counts parent: [70 70] gini: 0.5
counts left  : [67  0] gini: 0.0
counts right : [ 3 70] gini: 0.0788
information gain: 0.4589
# Visualize the split on the 1D data
jitter = rng.normal(0, 0.03, size=y_1d.size)

fig = go.Figure()
fig.add_trace(go.Scatter(x=x_1d, y=y_1d + jitter, mode='markers', name='data'))
fig.add_vline(x=best_t_gini, line_dash='dash', line_color='black', annotation_text='best split')
fig.update_layout(
    title='The best split threshold on the 1D dataset',
    xaxis_title='x',
    yaxis_title='class (jittered)'
)
fig.show()

4) Decision trees from scratch (NumPy)#

Below we implement a simplified, educational version of a CART-style tree:

  • continuous features

  • binary splits of the form \(x_j \le t\)

  • greedy choice: pick the best (feature, threshold) at each node

This is not optimized for speed; it’s meant to be readable.

4.1 Decision tree classifier (Gini or entropy)#

Stopping rules (common “pre-pruning” controls):

  • max_depth

  • min_samples_split

  • min_samples_leaf

Random forest also needs max_features: only consider a random subset of features at each split.

@dataclass
class Node:
    feature_index: int | None = None
    threshold: float | None = None
    left: "Node | None" = None
    right: "Node | None" = None
    value: object | None = None  # predicted class (classifier) or mean value (regressor)
    proba: np.ndarray | None = None  # class probabilities for classifier


def _resolve_max_features(max_features, n_features: int) -> int:
    if max_features is None:
        return n_features
    if isinstance(max_features, int):
        return max(1, min(n_features, max_features))
    if isinstance(max_features, float):
        return max(1, min(n_features, int(np.ceil(max_features * n_features))))
    if isinstance(max_features, str):
        key = max_features.lower()
        if key == "sqrt":
            return max(1, int(np.sqrt(n_features)))
        if key == "log2":
            return max(1, int(np.log2(n_features)))
    raise ValueError(f"Unsupported max_features={max_features!r}")


class DecisionTreeClassifierScratch:
    def __init__(
        self,
        *,
        criterion: str = "gini",
        max_depth: int | None = None,
        min_samples_split: int = 2,
        min_samples_leaf: int = 1,
        max_features=None,
        random_state: int = 0,
    ):
        self.criterion = criterion
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features
        self.random_state = random_state

        self.root_: Node | None = None
        self.classes_: np.ndarray | None = None
        self._rng = np.random.default_rng(random_state)

    def _impurity(self, y: np.ndarray) -> float:
        if self.criterion == "gini":
            return gini_impurity(y)
        if self.criterion in {"entropy", "log_loss"}:
            return entropy(y)
        raise ValueError(f"Unknown criterion: {self.criterion}")

    def _best_split(self, X: np.ndarray, y: np.ndarray) -> tuple[int | None, float | None, float]:
        n_samples, n_features = X.shape
        if n_samples < self.min_samples_split:
            return None, None, 0.0

        parent_impurity = self._impurity(y)
        if parent_impurity == 0.0:
            return None, None, 0.0

        k = _resolve_max_features(self.max_features, n_features)
        feature_indices = self._rng.choice(n_features, size=k, replace=False)

        best_gain = 0.0
        best_feature = None
        best_threshold = None

        for feature_index in feature_indices:
            x_col = X[:, feature_index]
            uniq = np.unique(x_col)
            if uniq.size <= 1:
                continue

            thresholds = (uniq[:-1] + uniq[1:]) / 2.0

            for t in thresholds:
                left = x_col <= t
                n_left = int(left.sum())
                n_right = n_samples - n_left

                if n_left < self.min_samples_leaf or n_right < self.min_samples_leaf:
                    continue

                gain = information_gain(y, y[left], y[~left], self._impurity)
                if gain > best_gain:
                    best_gain = gain
                    best_feature = int(feature_index)
                    best_threshold = float(t)

        return best_feature, best_threshold, float(best_gain)

    def _build(self, X: np.ndarray, y: np.ndarray, depth: int) -> Node:
        counts = np.bincount(y, minlength=len(self.classes_))
        proba = counts / counts.sum() if counts.sum() else np.zeros_like(counts, dtype=float)
        predicted_class = int(np.argmax(counts)) if counts.size else 0

        node = Node(value=self.classes_[predicted_class], proba=proba)

        if self.max_depth is not None and depth >= self.max_depth:
            return node
        if y.size < self.min_samples_split:
            return node
        if np.unique(y).size == 1:
            return node

        feature, threshold, gain = self._best_split(X, y)
        if feature is None or threshold is None or gain <= 0.0:
            return node

        left_mask = X[:, feature] <= threshold
        node.feature_index = feature
        node.threshold = threshold
        node.left = self._build(X[left_mask], y[left_mask], depth + 1)
        node.right = self._build(X[~left_mask], y[~left_mask], depth + 1)
        return node

    def fit(self, X: np.ndarray, y: np.ndarray):
        X = np.asarray(X, dtype=float)
        y_raw = np.asarray(y)
        self.classes_, y_enc = np.unique(y_raw, return_inverse=True)
        self.root_ = self._build(X, y_enc, depth=0)
        return self

    def _predict_one(self, x_row: np.ndarray) -> object:
        node = self.root_
        while node is not None and node.feature_index is not None:
            node = node.left if x_row[node.feature_index] <= node.threshold else node.right
        return node.value

    def predict(self, X: np.ndarray) -> np.ndarray:
        X = np.asarray(X, dtype=float)
        return np.array([self._predict_one(row) for row in X])


print('Scratch tree ready')
Scratch tree ready

Visual demo: scratch tree vs sklearn tree (classification)#

We’ll use make_moons (a classic non-linear dataset), and compare:

  • a shallow DecisionTreeClassifierScratch

  • sklearn.tree.DecisionTreeClassifier

We’ll plot the decision boundary.

def plot_decision_boundary(model, X: np.ndarray, y: np.ndarray, title: str, grid_steps: int = 250):
    x_min, x_max = X[:, 0].min() - 0.6, X[:, 0].max() + 0.6
    y_min, y_max = X[:, 1].min() - 0.6, X[:, 1].max() + 0.6

    xs = np.linspace(x_min, x_max, grid_steps)
    ys = np.linspace(y_min, y_max, grid_steps)
    xx, yy = np.meshgrid(xs, ys)
    grid = np.column_stack([xx.ravel(), yy.ravel()])

    z = model.predict(grid).reshape(xx.shape)

    fig = go.Figure()
    fig.add_trace(
        go.Contour(
            x=xs,
            y=ys,
            z=z,
            showscale=False,
            opacity=0.35,
            contours=dict(showlines=False),
            colorscale=[[0, "#7aa6ff"], [1, "#ff8c7a"]],
        )
    )
    fig.add_trace(
        go.Scatter(
            x=X[:, 0],
            y=X[:, 1],
            mode="markers",
            marker=dict(size=6, color=y, colorscale=[[0, "#1f77b4"], [1, "#d62728"]], line=dict(width=0)),
            name="data",
        )
    )
    fig.update_layout(title=title, xaxis_title="x1", yaxis_title="x2")
    return fig


X_moons, y_moons = make_moons(n_samples=450, noise=0.25, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X_moons, y_moons, test_size=0.25, random_state=42)

scratch_tree = DecisionTreeClassifierScratch(max_depth=4, min_samples_leaf=5, criterion="gini", random_state=42)
scratch_tree.fit(X_tr, y_tr)

sk_tree = DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, criterion="gini", random_state=42)
sk_tree.fit(X_tr, y_tr)

acc_scratch = accuracy_score(y_te, scratch_tree.predict(X_te))
acc_sklearn = accuracy_score(y_te, sk_tree.predict(X_te))

acc_scratch, acc_sklearn
(0.911504424778761, 0.911504424778761)

What did the tree learn?#

A nice sanity check is to print the learned rules.

  • sklearn provides export_text(...)

  • for our scratch tree, we’ll print a tiny readable representation

def print_scratch_tree(node, feature_names=("x1", "x2"), indent=""):
    if node.feature_index is None:
        if node.proba is None:
            print(f"{indent}Predict: {node.value}")
        else:
            p = np.array2string(node.proba, precision=2, floatmode='fixed')
            print(f"{indent}Predict: {node.value}  (proba={p})")
        return

    name = feature_names[node.feature_index] if feature_names else f"x[{node.feature_index}]"
    print(f"{indent}if {name} <= {node.threshold:.3f}:")
    print_scratch_tree(node.left, feature_names=feature_names, indent=indent + "  ")
    print(f"{indent}else:")
    print_scratch_tree(node.right, feature_names=feature_names, indent=indent + "  ")


print("Scratch tree rules:")
print()
print_scratch_tree(scratch_tree.root_)

print()
print("---")
print()
print("sklearn tree rules:")
print()
print(export_text(sk_tree, feature_names=["x1", "x2"]))
Scratch tree rules:

if x2 <= 0.230:
  if x1 <= -0.476:
    Predict: 0  (proba=[1.00 0.00])
  else:
    if x2 <= -0.055:
      if x2 <= -0.352:
        Predict: 1  (proba=[0.00 1.00])
      else:
        Predict: 1  (proba=[0.05 0.95])
    else:
      if x1 <= 1.479:
        Predict: 1  (proba=[0.34 0.66])
      else:
        Predict: 1  (proba=[0.00 1.00])
else:
  if x1 <= 1.378:
    if x2 <= 0.536:
      if x1 <= 0.537:
        Predict: 0  (proba=[0.70 0.30])
      else:
        Predict: 0  (proba=[1.00 0.00])
    else:
      if x2 <= 0.757:
        Predict: 0  (proba=[0.91 0.09])
      else:
        Predict: 0  (proba=[1.00 0.00])
  else:
    Predict: 1  (proba=[0.00 1.00])

---

sklearn tree rules:

|--- x2 <= 0.23
|   |--- x1 <= -0.48
|   |   |--- class: 0
|   |--- x1 >  -0.48
|   |   |--- x2 <= -0.06
|   |   |   |--- x2 <= -0.35
|   |   |   |   |--- class: 1
|   |   |   |--- x2 >  -0.35
|   |   |   |   |--- class: 1
|   |   |--- x2 >  -0.06
|   |   |   |--- x1 <= 1.48
|   |   |   |   |--- class: 1
|   |   |   |--- x1 >  1.48
|   |   |   |   |--- class: 1
|--- x2 >  0.23
|   |--- x1 <= 1.38
|   |   |--- x2 <= 0.54
|   |   |   |--- x1 <= 0.54
|   |   |   |   |--- class: 0
|   |   |   |--- x1 >  0.54
|   |   |   |   |--- class: 0
|   |   |--- x2 >  0.54
|   |   |   |--- x2 <= 0.76
|   |   |   |   |--- class: 0
|   |   |   |--- x2 >  0.76
|   |   |   |   |--- class: 0
|   |--- x1 >  1.38
|   |   |--- class: 1
fig1 = plot_decision_boundary(scratch_tree, X_te, y_te, f"Scratch DecisionTreeClassifier (acc={acc_scratch:.3f})")
fig1.show()

fig2 = plot_decision_boundary(sk_tree, X_te, y_te, f"sklearn DecisionTreeClassifier (acc={acc_sklearn:.3f})")
fig2.show()

4.2 Decision tree regressor (variance reduction)#

For regression, the “purity” is about how tightly clustered the target values are.

  • leaf prediction is usually the mean of \(y\) in that leaf

  • the split tries to reduce weighted variance

A tree regressor produces a piecewise constant function.

class DecisionTreeRegressorScratch:
    def __init__(
        self,
        *,
        max_depth: int | None = None,
        min_samples_split: int = 2,
        min_samples_leaf: int = 1,
        max_features=None,
        random_state: int = 0,
    ):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features
        self.random_state = random_state

        self.root_: Node | None = None
        self._rng = np.random.default_rng(random_state)

    def _best_split(self, X: np.ndarray, y: np.ndarray) -> tuple[int | None, float | None, float]:
        n_samples, n_features = X.shape
        if n_samples < self.min_samples_split:
            return None, None, 0.0

        parent_var = variance(y)
        if parent_var == 0.0:
            return None, None, 0.0

        k = _resolve_max_features(self.max_features, n_features)
        feature_indices = self._rng.choice(n_features, size=k, replace=False)

        best_reduction = 0.0
        best_feature = None
        best_threshold = None

        for feature_index in feature_indices:
            x_col = X[:, feature_index]
            uniq = np.unique(x_col)
            if uniq.size <= 1:
                continue

            thresholds = (uniq[:-1] + uniq[1:]) / 2.0
            for t in thresholds:
                left = x_col <= t
                n_left = int(left.sum())
                n_right = n_samples - n_left
                if n_left < self.min_samples_leaf or n_right < self.min_samples_leaf:
                    continue

                child_var = (n_left / n_samples) * variance(y[left]) + (n_right / n_samples) * variance(y[~left])
                reduction = parent_var - child_var

                if reduction > best_reduction:
                    best_reduction = float(reduction)
                    best_feature = int(feature_index)
                    best_threshold = float(t)

        return best_feature, best_threshold, float(best_reduction)

    def _build(self, X: np.ndarray, y: np.ndarray, depth: int) -> Node:
        node = Node(value=float(np.mean(y)) if y.size else 0.0)

        if self.max_depth is not None and depth >= self.max_depth:
            return node
        if y.size < self.min_samples_split:
            return node
        if variance(y) == 0.0:
            return node

        feature, threshold, reduction = self._best_split(X, y)
        if feature is None or threshold is None or reduction <= 0.0:
            return node

        left_mask = X[:, feature] <= threshold
        node.feature_index = feature
        node.threshold = threshold
        node.left = self._build(X[left_mask], y[left_mask], depth + 1)
        node.right = self._build(X[~left_mask], y[~left_mask], depth + 1)
        return node

    def fit(self, X: np.ndarray, y: np.ndarray):
        X = np.asarray(X, dtype=float)
        y = np.asarray(y, dtype=float)
        self.root_ = self._build(X, y, depth=0)
        return self

    def _predict_one(self, x_row: np.ndarray) -> float:
        node = self.root_
        while node is not None and node.feature_index is not None:
            node = node.left if x_row[node.feature_index] <= node.threshold else node.right
        return float(node.value)

    def predict(self, X: np.ndarray) -> np.ndarray:
        X = np.asarray(X, dtype=float)
        return np.array([self._predict_one(row) for row in X], dtype=float)


print('Scratch regressor ready')
Scratch regressor ready
# 1D regression dataset: nonlinear signal
n = 260
x_reg = np.sort(rng.uniform(-3, 3, size=n))
y_reg = np.sin(x_reg) + 0.25 * rng.normal(size=n)

X_reg = x_reg.reshape(-1, 1)

fig = px.scatter(x=x_reg, y=y_reg, title="Regression dataset: y = sin(x) + noise", labels={"x": "x", "y": "y"})
fig.show()
# Fit trees with different depths to show piecewise-constant behavior
x_grid = np.linspace(x_reg.min(), x_reg.max(), 600)
X_grid = x_grid.reshape(-1, 1)

depths = [1, 2, 3, 5, 8]
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_reg, y=y_reg, mode="markers", name="data", marker=dict(size=5)))

for d in depths:
    tree = DecisionTreeRegressorScratch(max_depth=d, min_samples_leaf=6, random_state=42)
    tree.fit(X_reg, y_reg)
    y_hat = tree.predict(X_grid)
    fig.add_trace(go.Scatter(x=x_grid, y=y_hat, mode="lines", name=f"depth={d}"))

fig.update_layout(title="Scratch regression trees: deeper = more flexible", xaxis_title="x", yaxis_title="prediction")
fig.show()

Compare with sklearn (regression)#

The overall shape should look similar: deeper trees fit more wiggles.

# sklearn decision tree regressor on the same 1D regression data
from sklearn.tree import DecisionTreeRegressor

fig = go.Figure()
fig.add_trace(go.Scatter(x=x_reg, y=y_reg, mode="markers", name="data", marker=dict(size=5)))

for d in [1, 2, 3, 5, 8]:
    m = DecisionTreeRegressor(max_depth=d, min_samples_leaf=6, random_state=42)
    m.fit(X_reg, y_reg)
    fig.add_trace(go.Scatter(x=x_grid, y=m.predict(X_grid), mode="lines", name=f"sk depth={d}"))

fig.update_layout(title="sklearn regression trees", xaxis_title="x", yaxis_title="prediction")
fig.show()

Random forest regressor (variance reduction in regression)#

A single regression tree can be very “blocky”. A random forest regressor averages many trees and often produces a smoother approximation.

# Train/test split for the regression task
Xr_tr, Xr_te, yr_tr, yr_te = train_test_split(X_reg, y_reg, test_size=0.25, random_state=42)

single_reg = DecisionTreeRegressor(max_depth=8, min_samples_leaf=3, random_state=42)
rf_reg = RandomForestRegressor(
    n_estimators=400,
    max_depth=None,
    max_features=1.0,
    min_samples_leaf=3,
    random_state=42,
)

single_reg.fit(Xr_tr, yr_tr)
rf_reg.fit(Xr_tr, yr_tr)

mse_single = mean_squared_error(yr_te, single_reg.predict(Xr_te))
mse_rf = mean_squared_error(yr_te, rf_reg.predict(Xr_te))

mse_single, mse_rf
(0.07297656765298927, 0.06900414442614336)
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_reg, y=y_reg, mode="markers", name="data", marker=dict(size=5)))

fig.add_trace(go.Scatter(x=x_grid, y=single_reg.predict(X_grid), mode="lines", name=f"single tree (MSE={mse_single:.3f})"))
fig.add_trace(go.Scatter(x=x_grid, y=rf_reg.predict(X_grid), mode="lines", name=f"random forest (MSE={mse_rf:.3f})"))

fig.update_layout(title="Regression: single tree vs random forest", xaxis_title="x", yaxis_title="prediction")
fig.show()

5) scikit-learn trees + key hyperparameters#

scikit-learn uses an optimized CART implementation.

Common parameters:

  • criterion:

    • classifier: "gini" or "entropy" (also "log_loss" in newer versions)

    • regressor: "squared_error" (variance reduction)

  • max_depth: maximum depth (prevents overfitting)

  • min_samples_split, min_samples_leaf: prevent overly small leaves

  • max_features: random subset of features per split (key for random forests)

  • ccp_alpha: cost-complexity pruning (post-pruning)

Below is a quick look at a trained sklearn tree in text form.

sk_tree_small = DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, criterion="gini", random_state=42)
sk_tree_small.fit(X_tr, y_tr)

print(export_text(sk_tree_small, feature_names=["x1", "x2"]))
|--- x2 <= 0.23
|   |--- x1 <= -0.48
|   |   |--- class: 0
|   |--- x1 >  -0.48
|   |   |--- x2 <= -0.06
|   |   |   |--- class: 1
|   |   |--- x2 >  -0.06
|   |   |   |--- class: 1
|--- x2 >  0.23
|   |--- x1 <= 1.38
|   |   |--- x2 <= 0.54
|   |   |   |--- class: 0
|   |   |--- x2 >  0.54
|   |   |   |--- class: 0
|   |--- x1 >  1.38
|   |   |--- class: 1

6) Ensembles: bagging and random forests#

A single decision tree is a high-variance learner:

  • small data changes can produce a different tree

Bagging (“bootstrap aggregating”) reduces variance:

  1. sample many bootstrap datasets

  2. train a model on each

  3. average (regression) or majority vote (classification)

Random forests add another trick:

  • at each split, only consider a random subset of features (max_features)

This decorrelates the trees, making the average stronger.

Analogy:

  • One expert can be inconsistent.

  • A panel of experts who don’t all think the same way tends to be more reliable.

class RandomForestClassifierScratch:
    def __init__(
        self,
        *,
        n_estimators: int = 50,
        max_depth: int | None = None,
        min_samples_split: int = 2,
        min_samples_leaf: int = 1,
        max_features="sqrt",
        random_state: int = 0,
    ):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features
        self.random_state = random_state

        self.trees_: list[DecisionTreeClassifierScratch] = []
        self._rng = np.random.default_rng(random_state)

    def fit(self, X: np.ndarray, y: np.ndarray):
        X = np.asarray(X, dtype=float)
        y = np.asarray(y)
        n_samples = X.shape[0]

        self.trees_ = []
        for _ in range(self.n_estimators):
            idx = self._rng.integers(0, n_samples, size=n_samples)
            tree = DecisionTreeClassifierScratch(
                criterion="gini",
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                min_samples_leaf=self.min_samples_leaf,
                max_features=self.max_features,
                random_state=int(self._rng.integers(0, 1_000_000_000)),
            )
            tree.fit(X[idx], y[idx])
            self.trees_.append(tree)

        return self

    def predict(self, X: np.ndarray) -> np.ndarray:
        X = np.asarray(X, dtype=float)
        preds = np.vstack([tree.predict(X) for tree in self.trees_])

        # Majority vote per column
        out = []
        for j in range(preds.shape[1]):
            values, counts = np.unique(preds[:, j], return_counts=True)
            out.append(values[int(np.argmax(counts))])
        return np.array(out)


# Compare: single tree vs scratch forest
single = DecisionTreeClassifierScratch(max_depth=6, min_samples_leaf=2, random_state=42)
single.fit(X_tr, y_tr)

forest = RandomForestClassifierScratch(n_estimators=120, max_depth=8, min_samples_leaf=2, max_features="sqrt", random_state=42)
forest.fit(X_tr, y_tr)

acc_single = accuracy_score(y_te, single.predict(X_te))
acc_forest = accuracy_score(y_te, forest.predict(X_te))

acc_single, acc_forest
(0.8938053097345132, 0.9203539823008849)
fig = plot_decision_boundary(single, X_te, y_te, f"Scratch single tree (acc={acc_single:.3f})")
fig.show()

fig = plot_decision_boundary(forest, X_te, y_te, f"Scratch random forest (acc={acc_forest:.3f})")
fig.show()

Random forests in scikit-learn#

Key parameters you’ll actually tune:

  • n_estimators: number of trees

  • max_depth: tree depth

  • max_features: features per split ("sqrt" is a common default)

  • min_samples_leaf: makes trees less “spiky”

  • bootstrap: whether to bootstrap samples

  • oob_score: out-of-bag estimate (free-ish validation when bootstrapping)

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    max_features="sqrt",
    min_samples_leaf=2,
    bootstrap=True,
    oob_score=True,
    random_state=42,
)
rf.fit(X_tr, y_tr)

acc_rf = accuracy_score(y_te, rf.predict(X_te))

acc_rf, rf.oob_score_
(0.9203539823008849, 0.9287833827893175)
# Feature importances (on this synthetic dataset they're not very meaningful, but the API is)
fig = px.bar(x=["x1", "x2"], y=rf.feature_importances_, title="RandomForest feature_importances_")
fig.update_layout(xaxis_title="feature", yaxis_title="importance")
fig.show()

7) Gradient boosting and modern libraries#

Bagging/forests mainly reduce variance.

Boosting mainly reduces bias.

The tutoring analogy#

Imagine a student learning math:

  • Tutor #1 explains the basics.

  • Tutor #2 focuses on what the student still gets wrong.

  • Tutor #3 focuses on the remaining mistakes.

Boosting is like that: it builds an additive model

\[ F_M(x) = \sum_{m=1}^M \eta\, f_m(x) \]

where each new tree \(f_m\) is trained to reduce the current errors.

Modern boosting libraries:

  • XGBoost: strong regularization, second-order optimization, robust defaults, great for tabular

  • LightGBM: histogram + leaf-wise growth, very fast on large data

  • CatBoost: excellent categorical handling (ordered target statistics), strong out-of-the-box performance

We’ll demonstrate boosting with sklearn and then map the ideas to the libraries.

# Boosting demo (sklearn HistGradientBoostingClassifier)
# Note: this model can work without scaling; we keep features as-is.

X_tr2, X_val2, y_tr2, y_val2 = train_test_split(X_tr, y_tr, test_size=0.25, random_state=7)

gb = HistGradientBoostingClassifier(
    max_depth=3,
    learning_rate=0.1,
    max_iter=250,
    random_state=42,
)

gb.fit(X_tr2, y_tr2)

val_losses = []
for proba in gb.staged_predict_proba(X_val2):
    val_losses.append(log_loss(y_val2, proba))

best_iter = int(np.argmin(val_losses) + 1)

fig = px.line(x=np.arange(1, len(val_losses) + 1), y=val_losses, title=f"Boosting: validation log loss (best_iter={best_iter})")
fig.update_layout(xaxis_title="# trees (iterations)", yaxis_title="log loss")
fig.show()

best_iter
47

XGBoost vs LightGBM vs CatBoost (high-level)#

All three are gradient-boosted decision tree (GBDT) systems, but they differ in engineering choices.

Tree growth strategy#

  • XGBoost (often): level-wise growth (more balanced)

  • LightGBM: typically leaf-wise growth (can be faster/more accurate, can overfit if unconstrained)

  • CatBoost: commonly uses symmetric / oblivious trees (same split per depth level)

Categorical features#

  • CatBoost: built-in strong categorical handling (ordered target statistics)

  • XGBoost / LightGBM: can handle categoricals in newer versions/modes, but many workflows still rely on encoding

Practical hyperparameter map (mental model)#

Concept

XGBoost

LightGBM

CatBoost

# trees

n_estimators / num_boost_round

num_iterations

iterations

step size

learning_rate (eta)

learning_rate

learning_rate

tree size

max_depth, min_child_weight

num_leaves, min_data_in_leaf

depth, min_data_in_leaf

row sampling

subsample

bagging_fraction

subsample

col sampling

colsample_bytree

feature_fraction

rsm

L2 reg

reg_lambda

lambda_l2

l2_leaf_reg

L1 reg

reg_alpha

lambda_l1

(limited / different)

Rule of thumb:

  • If you increase tree complexity, counterbalance with stronger regularization and/or smaller learning rate.

# Optional: check whether the big boosting libs are installed in this environment.
# (This notebook still makes sense if they're not installed.)

libs = {
    "xgboost": "xgboost",
    "lightgbm": "lightgbm",
    "catboost": "catboost",
}

for name, mod in libs.items():
    try:
        m = __import__(mod)
        ver = getattr(m, "__version__", "(unknown version)")
        print(f"{name}: installed ({ver})")
    except Exception as e:
        print(f"{name}: not installed ({type(e).__name__}: {e})")
xgboost: not installed (ModuleNotFoundError: No module named 'xgboost')
lightgbm: not installed (ModuleNotFoundError: No module named 'lightgbm')
catboost: not installed (ModuleNotFoundError: No module named 'catboost')

Exercises#

  1. Change max_depth of a tree and visualize how the decision boundary changes.

  2. Increase noise in make_moons and compare single tree vs forest.

  3. For boosting, try learning_rate=0.05 with more max_iter and see the validation curve.

References#

  • ISLR (James, Witten, Hastie, Tibshirani)

  • ESL (Hastie, Tibshirani, Friedman)

  • sklearn documentation: DecisionTree, RandomForest, HistGradientBoosting

  • XGBoost / LightGBM / CatBoost official docs